## Business problem {: #business-problem }

A key pillar of any AML compliance program is to monitor transactions for suspicious activity. The scope of transactions is broad, including deposits, withdrawals, fund transfers, purchases, merchant credits, and payments. Typically, monitoring starts with a rules-based system that scans customer transactions for red flags consistent with money laundering. When a transaction matches a predetermined rule, an alert is generated and the case is referred to the bank’s internal investigation team for manual review. If the investigators conclude the behavior is indicative of money laundering, then the bank will file a Suspicious Activity Report (SAR) with FinCEN.

Unfortunately, the standard transaction monitoring system described above has costly drawbacks. In particular, the rate of false-positives (cases incorrectly flagged as suspicious) generated by this rules-based system can reach 90% or more. Since the system is rules-based and rigid, it cannot dynamically learn the complex interactions and behaviors behind money laundering. The prevalence of false-positives makes investigators less efficient as they have to manually weed out cases that the rules-based system incorrectly marked as suspicious.

Compliance teams at financial institutions can have hundreds or even thousands of investigators, and the current systems prevent investigators from becoming more effective and efficient in their investigations. The cost of reviewing an alert ranges between `$30~$70`. For a bank that receives 100,000 alerts a year, this is a substantial sum; on average, penalties imposed for proven money laundering amount to `$145` million per case.  A reduction in false positives could result in savings between `$600,000~$4.2` million per year.

## Solution value {: #solution-value }

This use case builds a model that dynamically learns patterns in complex data and reduces false positive alerts. Financial crime compliance teams can then prioritize the alerts that legitimately require manual review and dedicate more resources to those cases most likely to be suspicious. By learning from historical data to uncover patterns related to money laundering, AI also helps identify which customer data and transaction activities are indicative of a high risk for potential money laundering.

The primary issues and corresponding opportunities that this use case addresses include:

Issue | Opportunity
:- | :-
Potential regulatory fine  | Mitigate the risk of missing suspicious activities due to lack of competency with alert investigations. Use alert scores to more effectively assign alerts&mdash;high risk alerts to more experienced investigators, low risk alerts to more junior team members.
Investigation productivity | Increase investigators' productivity by making the review process more effective and efficient, and by providing a more holistic view when assessing cases.

Specifically:

* **Strategy/challenge**:  Help investigators focus their attention on cases that have the highest risk of money laundering while minimizing the time they spend reviewing false-positive cases.

    For banks with large volumes of daily transactions, improvements in the effectiveness and efficiency of their investigations ultimately results in fewer cases of money laundering that go unnoticed. This allows banks to enhance their regulatory compliance and reduce the volume of financial crime present within their network.

* **Business driver**: Improve the efficiency of AML transaction monitoring and lower operational costs.

    With its ability to dynamically learn patterns in complex data, AI significantly improves accuracy in predicting which cases will result in a SAR filing. AI models for anti-money laundering can be deployed into the review process to score and rank all new cases.

* **Model solution**: Assign a suspicious activity score to each AML alert, improving the efficiency of an AML compliance program.

    Any case that exceeds a predetermined threshold of risk is sent to the investigators for manual review. Meanwhile, any case that falls below the threshold can be automatically discarded or sent to a lighter review. Once AI models are deployed into production, they can be continuously retrained on new data to capture any novel behaviors of money laundering. This data will come from the feedback of investigators.

    Specifically, the model will use rules that trigger an alert whenever a customer requests a refund of any amount since small refund requests could be the money launderer’s way of testing the refund mechanism or trying to establish refund requests as a normal pattern for their account.

The following table summarizes aspects of this use case.

Topic | Description
:- | :-
**Use case type** | Anti-money laundering (false positive reduction)
**Target audience** | Data Scientist, Financial Crime Compliance Team
**Desired outcomes**| <ul><li>Identify which customer data and transaction activity are indicative of a high risk for potential money laundering.</li><li>Detect anomalous changes in behavior or nascent money laundering patterns before they spread.</li><li>Reduce the false positive rate for the cases selected for manual review.</li></ul>
**Metrics/KPIs** | <ul><li>Annual alert volume</li><li>Cost per alert</li><li>False positive reduction rate</li></ul>
**Sample dataset** | https://s3.amazonaws.com/datarobot-use-case-datasets/DR_Demo_AML_Alert_train.csv    

### Problem framing {: #problem-framing }

The target variable for this use case is **whether or not the alert resulted in a SAR** after manual review by investigators, making this a binary classification problem. The unit of analysis is an individual alert&mdash;the model will be built on the alert level&mdash;and each alert will receive a score ranging from 0 to 1. The score indicates the probability of being a SAR.

The goal of applying a model to this use case is to lower the false positive rate, which means resources are not spent reviewing cases that are eventually determined not to be suspicious after an investigation.

In this use case, the False Positive Rate of the rules engine on the validation sample (1600 records) is:

The number of `SAR=0` divided by the total number of records = `1436/1600` = `90%`.

### ROI estimation {: #roi-estimation }

ROI can be calculated as follows:

`Avoided potential regulatory fine + Annual alert volume * false positive reduction rate * cost per alert`

A high-level measurement of the ROI equation involves two parts.

1. The total amount of `avoided potential regulatory fines` will vary depending on the nature of the bank and must be estimated on a case-by-case basis.

2. The second part of the equation is where AI can have a tangible impact on improving investigation productivity and reducing operational costs. Consider this example:

    * A bank generates 100,000 AML alerts every year.
    * DataRobot achieves a 70% false positive reduction rate without losing any historical suspicious activities.
    * The average cost per alert is `$30~$70`.

    Result: The annual ROI of implementing the solution will be `100,000 * 70% * ($30~$70) = $2.1MM~$4.9MM`.

## Working with data {: #working-with-data }

The linked synthetic dataset illustrates a credit card company’s AML compliance program. Specifically, the model detects the following money-laundering scenarios:

- The customer spends on the card but overpays their credit card bill and seeks a cash refund for the difference.
- The customer receives credits from a merchant without offsetting transactions and either spends the money or requests a cash refund from the bank.

The unit of analysis in this dataset is an individual alert, meaning a rule-based engine is in place to produce an alert to detect potentially suspicious activity consistent with the above scenarios.

### Data preparation {: #data-preparation }

Consider the following when working with data:

* **Define the scope of analysis**: Collect alerts from a specific analytical window to start with; it’s recommended that you use 12–18 months of alerts for model building.

* **Define the target**: Depending on the investigation processes, the target definition could be flexible. In this walkthrough, alerts are classified as `Level1`, `Level2`, `Level3`, and `Level3-confirmed`. These labels indicate at which level of the investigation the alert was closed (i.e., confirmed as a SAR). To create a binary target, treat `Level3-confirmed` as SAR (denoted by 1) and the remaining levels as non-SAR alerts (denoted by 0).

* **Consolidate information from multiple data sources**: Below is a sample entity-relationship diagram indicating the relationship between the data tables used for this use case.

![](images/aml-entity-rel.png)

Some features are static information&mdash;for example, `kyc_risk_score` and `state of residence`&mdash;these can be fetched directly from the reference tables.  

For transaction behavior and payment history, the information will be derived from a specific time window prior to the alert generation date. This case uses 90 days as the time window to obtain the dynamic customer behavior, such as `nbrPurchases90d`, `avgTxnSize90d`, or `totalSpend90d`.

Below is an example of one row in the training data after it is merged and aggregated (it is broken into multiple lines for easier visualization).

![](images/aml-training-row.png)

### Features and sample data {: #features-and-sample-data }

The features in the sample dataset consist of KYC (Know-Your-Customer) information, demographic information, transactional behavior, and free-form text information from notes taken by customer service representatives. To apply this use case in your organization, your dataset should contain, at a minimum, the following features:

- Alert ID
- Binary classification target (`SAR/no-SAR`, `1/0`, `True/False`, etc.)
- Date/time of the alert
- "Know Your Customer" score used at the time of account opening
- Account tenure, in months
- Total merchant credit in the last 90 days
- Number of refund requests by the customer in the last 90 days
- Total refund amount in the last 90 days

Other helpful features to include are:

- Annual income
- Credit bureau score
- Number of credit inquiries in the past year
- Number of logins to the bank website in the last 90 days
- Indicator that the customer owns a home
- Maximum revolving line of credit
- Number of purchases in the last 90 days
- Total spend in the last 90 days
- Number of payments in the last 90 days
- Number of cash-like payments (e.g., money orders) in last 90 days
- Total payment amount in last 90 days
- Number of distinct merchants purchased from in the last 90 days
- Customer Service Representative notes and codes based on conversations with customer (cumulative)

The table below shows a sample feature list:

Feature name | Data type | Description | Data source | Example
------------ | --------- | ----------- | ----------- | -------
ALERT | Binary | Alert Indicator | tbl_alert | 1
SAR | Binary(Target) | SAR Indicator (Binary Target) | tbl_alert | 0
kycRiskScore | Numeric | Account relationship (Know Your Customer) score used at time of account opening | tbl_customer | 2
income | Numeric | Annual income | tbl_customer | 32600
tenureMonths | Numeric | Account tenure in months | tbl_customer | 13
creditScore | Numeric | Credit bureau score | tbl_customer | 780
state | Categorical | 	Account billing address state | tbl_account | VT
nbrPurchases90d	 | Numeric | Number of purchases in last 90 days | tbl_transaction | 4
avgTxnSize90d | Numeric | 	Average transaction size in last 90 days | tbl_transaction | 28.61
totalSpend90d | Numeric | 	Total spend in last 90 days | tbl_transaction | 114.44
csrNotes | Text | 	Customer Service Representative notes and codes based on conversations with customer (cumulative) | tbl_customer_misc | call back password call back card password replace atm call back
nbrDistinctMerch90d	 | Numeric | Number of distinct merchants purchased at in last 90 days | tbl_transaction | 1
nbrMerchCredits90d	 | Numeric	 | Number of credits from merchants in last 90 days | tbl_transaction | 	0
nbrMerchCredits-RndDollarAmt90d | Numeric | 	Number of credits from merchants in round dollar amounts in last 90 days | tbl_transaction | 	0
totalMerchCred90d | Numeric	 | Total merchant credit amount in last 90 days | tbl_transaction | 	0
nbrMerchCredits-WoOffsettingPurch | Numeric | Number of merchant credits without an offsetting purchase in last 90 days | tbl_transaction | 0
nbrPayments90d	 | Numeric	 | Number of payments in last 90 days | tbl_transaction | 3
totalPaymentAmt90d	 | Numeric | Total payment amount in last 90 days | tbl_account_bill | 114.44
overpaymentAmt90d | Numeric | 	Total amount overpaid in last 90 days | tbl_account_bill | 0
overpaymentInd90d | Numeric	 | Indicator that account was overpaid in last 90 days | tbl_account_bill | 0
nbrCustReqRefunds90d | Numeric | Number refund requests by the customer in last 90 days | tbl_transaction | 1
indCustReqRefund90d	 | Binary | Indicator that customer requested a refund in last 90 days | tbl_transaction | 1
totalRefundsToCust90d | Numeric | 	Total refund amount in last 90 days | tbl_transaction | 56.01
nbrPaymentsCashLike90d | 	Numeric | 	Number of cash like payments (e.g., money orders) in last 90 days | 	tbl_transaction | 	0
maxRevolveLine | Numeric | Maximum revolving line of credit | tbl_account | 14000
indOwnsHome	 | Numeric | Indicator that the customer owns a home | tbl_transaction | 1
nbrInquiries1y | Numeric | Number of credit inquiries in the past year | tbl_transaction | 0
nbrCollections3y |  Numeric | Number of collections in the past year | tbl_collection | 0
nbrWebLogins90d | 	Numeric	 | Number of logins to the bank website in the last 90 days | tbl_account_login | 7
nbrPointRed90d | Numeric | Number of loyalty point redemptions in the last 90 days | tbl_transaction | 2
PEP	 | Binary | Politically Exposed Person indicator | tbl_customer | 0
